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Abstract 

The statistical problem for network tomography is to infer the distribution of X, with 
mutually independent components, from a measurement model Y = AX., where A is a 
given binary matrix representing the routing topology of a network under consideration. 
The challenge is that the dimension of X is much larger than that of Y and thus the 
problem is often called ill-posed. This paper studies some statistical aspects of network 
tomography. We first address the identifiability issue and prove that the X distribution 
is identifiable up to a shift parameter under mild conditions. We then use a mixture 
model of characteristic functions to derive a fast algorithm for estimating the distribution 
of X based on the General method of Moments. Through extensive model simulation 
and real Internet trace driven simulation, the proposed approach is shown to be favorable 
comparing to previous methods using simple discretization for inferring link delays in a 
heterogeneous network. 

Keywords: Network tomography, identifiability, characteristic function, mixture model 

1 Introduction 

Network performance monitoring and diagnosis is challenging due to the size and decentralized 
nature of the Internet. The service providers may collect their link level statistics using tools 
such as Cisco Netflow, whereas the end users can obtain the end-to-end performance by prob- 
ing the network. Unfortunately, none of them has a global view of the Internet. For instance, 
when an end-to-end measurement indicates the performance degradation of an Internet path, 
the exact cause is hard to be uncovered because the path may traverse several autonomous 
systems (AS) that are often owned by different entities and the service providers generally do 
not share their internal performance. Even if they do, there is no scalable way to correlate 
the link level measurements to end-to-end performance in a large network like the Internet. 
Similarly, the service providers may be interested in the end-to-end path characteristics that 
they can not observe directly. Network tomography is a technology addressing these issues 
that infers unobservable characteristics from easily available measurements. There have been 
two forms of network tomography being studied in the literature. One, called network delay 
tomography, estimates the link-level characteristics based on end-to-end measurements, and 
the other, called traffic demand tomography, predicts end-to-end path-level traffic intensities 
based on link-level traffic measurements. The key advantage of network tomography is that 
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it does not r e quire t he collaborat i on be tween network internal elements and end users. See 
Castro et all ^2004 ). IPenbv et al] (|2007l ^ and references therein for an excellent review. We 
focus on network delay tomography in this paper, while the proposed approach may also be 
applied to traffic demand tomography. 

Network delay tomography aims to estimating network internal characteristics such as loss 
and dela}0, from end-to-end measurements by exploiting the inherent correlation in perfor- 
mance. Considering a tree spanning a source of probes (root) and a set of receivers (leaves), 
the packets are potentially subject to queuing delay and loss at each link. The end-to-end 
(source-to-receiver) measurements may be made passively or actively. The p robes for the ac 



five m easur ements can b e sent u sing either multicast or unicast routing. See Lawrence et al 



tood ) and IPenbv et al] ^2Q0i \ for examples of how unicast and multicast probes can be 
designed and sent. Because only one copy of a probe is transmitted on the common links, 
multicast probing based tomography has the advantag e of perfect correlat ion o n the common 
l inks, less overhead, and better scalability. Following iPresti et al.l (120021 ) and iLiang and Yu 
(|2003l ^. we assume that measurements are collected from multicast probes, al though the mul - 
ticast routing is not widely enabled in today's Internet. It has been shown in lBu et al.l (120021 ) 
on how to apply the tomography algorithms developed for multicast measurements when only 
unicast measurements are available. 

The statistical models for both types of network tomography can be unified as follows: 



AX, 



(1) 



where X = {Xi, . . . , Xj) is a J-dimensional vector of network dynamic parameters, and 
Y = (Yi, . . . , 1/)^ is an /-dimensional vector of measurements and ^ is an / x J matrix with 
elements or 1 which represents the routing topology of the network under consideration. 
Here we use the superscript T to denote the transpose. In most network tomography scenarios, 
the components of X are assumed independent but unobservable. Usually / can be as large as 

for network demand tomography and as large as 2 J — 1 for network delay tomography. In 
network delay tomography, each component of X represents an internal link delay and each 
component of Y represents a delay measurement from a source to a destination. The objective 
of network tomography is to estimate the distribution of X given independent observations 
from the distribution of Y. 

As a simple example. Figure [U shows a two- leaf tree topology, on which a probing packet is 
sent from the root node (source) to leaf nodes 2 and 3 (receivers). When the packet arrives 
at node 1 from the source, it is replicated and transmitted to node 2 and 3 simultaneously as 
the red arrows show. Let Xi denote the link delay from node to node 1 and let Xi denote 
the link delay from node 1 to node i, i = 2,3 respectively. Let li, be the end-to-end delays 
from node to node 2 and 3 respectively. Then Yi = Xi + X2 and Y2 = Xi + X^ , which can 
be written in the form of ([T]) with A a 2 x 3 binary matrix, i.e. A = [1, 1, 0; 1, 0, 1]. 

There have been significant amount of works on net work tomography in recent years. 
Network tomography wa s first proposed by Vardil (1996) and t hen fo llowed by remarkably 
Tebaldi and West! (|l998l l. ICao et al.l toO&i and iLiang and Y^ (I2OO3I) for traffic matri x es- 
timation, i.e. traffic demand tomography. Caceres et al. ( 19991 ). Zhu and Ceng ( 20051 ) and 



^To be precise, the delay here is the queuing delay that excludes the constant link propagation delay, we 
omit queuing when context is clear 

^With multicast, a packet is sent from a source to multiple destinations simultaneously; with unicast, a 
packet is sent to different destinations separately 
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Xi et al.1 ^2m(i ) among others studied it for ir iferring network iii ternal loss. Network delay 



tomography has also been studied extensively . Presti et al. ( 20021 ) developed a fast algebraic 
algorithm but it is quite inefficient. iBu et al.l (l2002l ) showed tha ,t the maxini u m lik elihood es- 
timate (MLE) requires exponential comp utational comp l exity. Tsang et al. ( 20031 ) proposed 
a penalized maximum likelihood method. iLiang and Yul (1200311 propo s ed a p seudo-likelihood 
method with multicast measurements, and recentlv iLawrence et al. ( 20061 ) proposed local 
likelihood method with both unicast and multicast measurements, both of which were shown 
to be fast and quite efficient compared with the MLE. These studies are based on a discrete 
distribution with equally spaced bins for modeling link delays, where the same bin width is 
use d for all the links for the ease of computation. 

Duffield et al.l (|200lh pointed out that, however, a single fixed bin width is not appropriate 



for heterogeneous networks such as the Internet because it does not scale well between both 
fast links and slow links. They proposed a varying-bin discrete model for estimating link delay 
distributions based on unicast measurements. Their estimation idea is to use structured bins 
such that they can iteratively estimate a segment of delay distributions by truncating the 
delays from both sides, i.e. rounding the left of the segment to zero and the right to infinity. 
However, the performance of their estimation approach is not better than that using an 
equal-bin discrete model with an appropriate bin width as they reported, probably due to the 
bias introduced by their brute-force truncation. Our approach i s also based on varying-bin 
type models but does not suffer from such bias. Shih and Hero ( 200ll ) proposed to estimate 
cumulative generating functions (similar to characteristic functi ons used in th is pap er) of 
link delays, but they did not estimate link delay distributions. Shih and Herd ( 20031 ) also 
proposed finite mixture models with Gaussian components for link delay distributions based 
on unicast measurements. 

There are several previous works that have c onside r ed the ide n tifiab ility i ssue for the 
netwo rk tomography problem, for example Vardi ( 19961 ). Cao et al. ( 2000l ) and Presti et al. 
(I2OO2I ). These authors considered instances of the tomography problem by assuming specific 
parametric (such as Poisson and Gaussian) or discrete distributions. We will unify these re- 
sults and extend the identifiability condition to general distributions under mild assumptions. 

The contributions of this paper are as follows. First, we prove that the distribution of 
X is identifiable up to a shift parameter under general conditions. Second, we propose flex- 
ible mixture models of characteristic functions for network delay tomography and develop a 
fast algorithm for estimation based on the General Method of Moments (GMM). The new 
approach allows one to model continuous delays on heterogeneous network links conveniently, 
where delays may not have the same scale across all network links. Extensive model sim- 
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ulation and real Internet trace-driven simulation suggest that our new approach can yield 
more accurate estimates of link delay distributions yet is computationally less expensive than 
previous approaches. 

The remaining sections of the paper are structured as follows. In Section 2, we address the 
identifiability issue. We describe the mixture models for link delays in Section 3 and develop 
a fast algorithm for estimating the delay distributions in Section 4. In Section 5, we present 
extensive experimental studies for evaluating the proposed method. Section 6 concludes the 
paper. 

2 Identifiability 

In this section, we study the identifiability issue for model ([T]) and prove that the distribution 
of X is identifiable up to a shift parameter under mild conditions. The main tool we use is 
characteristic function whose basic properties are reviewed below. 

2.1 Characteristic Function 

A characteristic function of a univariate random variable Z is defined by 



where E\\ denotes the expectation with respect to Z and /z(-) is the probability density 
function of Z . By convention, and fz denote the characteristic function and probability 
density function of Z, respectively. The characteristic function for a random vector Z G VP 
can be defined in a similar manner by considering t G nP instead of TZ. It is well known 
that a probability distribution can be uniquely specified by its characteristic function and 
vise versa. 

Suppose Z\ and Z2 are two independent random variables. Then the joint characteristic 
function of Z = (Zi, Z2) is 



which is a product of the marginal characteristic functions of Z. Let V = Zi + Z2, then the 
characteristic function of V is simply a product of the characteristic functions of Zi and Z2, 
i.e.. 



This is much easier to compute than the density function of V , say which is a convo- 

lution of densities of Zi and Z2, i.e., 



For the tomography model ([T]), since the components of X are mutually independent, it is 
easy to evaluate the characteristic function of Y by 




4>7.{t) = 4>zAti)4>z2{t2), t 



(ii,i2), 



(t)v{t) = ^zAi)4>z,{t)- 




J 



(2) 
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where is the jth column of A. However, it is in general difficult to evaluate the distribution 
of Y because it is a high order convolution in terms of the distribution of X. Below we will 
use the formula ([2]) for both the identifiability proof and estimation in network tomography. 



2.2 Identifiability 

By identifiability, we mean that the distribution of X can be uniquely determined by the 
distribution of Y. It is important to establish the identifiability. Otherwise, the distribution 
of X may not be estimable from the distribution of Y. In the following, we present our 
general theorems for identifiability and discuss related issues. 

We assume that -E'l-'^jl < oo, j = 1, • • • , J and that the distribution of X satisfies one of 
the two conditions, namely CI and C2, defined below. 

(CI) the characteristic function of each Xj is analytical; 

(C2) the characteristic function of each Xj has no zeros in TZ. 

We first address the identifiability issue in Lemma 1 for the simple two-leaf tree tomogra- 
phy model described earlier with Figure [TJ The result will serve as the basis for Theorem 1 
and 2 below where the routing topology is not a simple two-leaf tree. 

Lemma 1. // Yi = Xi + X2 and Y2 = Xi + X3, then the distributions of Xi, X2, X3 can be 
identified up to a shift parameter. 

Proof. Suppose there exist both X = {Xi, X2, X^)"^ and X' = {X[, X2, X'^)'^ with mutually 
independent components that give rise to the same distribution Y = {Yi,Y2), then we show 
that distributions of Xj and Xj, j = 1, 2, 3, are the same up to a shift parameter. By ([2]), we 
have for t,s £ TZ, 

cl>xdt + s)(Px2it)(l)xAs) = cl)x[it + s)(Px^{t)(j)x;^{s). (3) 

Notice that (pj{t) = log(j)Xj{t)/4'x'.(t) is well defined in a neighborhood of the origin with 
ipj{0) = 0, j = 1, 2, 3. Thus for t and s in a neighborhood of zero, 

ifiit + s) + ip2{t) + y^sis) = 0. 



By using the argument of finite differences (c.f. Lemma 1.5.1 of Kagan et al. ( 19731 )). each 



is a linear complex function in a neighborhood of zero and thus in TZ with the given condition. 
That is, there exist complex numbers aj,bj such that (j)Xj{t) = (^)c"^ for any t £ TZ. 
By evaluating both sides at t = 0, = 0. By taking the first order derivative on both sides 
at zero, iE[Xj] = iE[Xj] + ibj and thus bj e TZ, due to Xj,Xj £ TZ. Hence Xj and Xj + bj 
have the same distribution. Further, AE[K] = AE[K'] implies b2 = b^ = —hi. □ 

For network delay tomography, as a generalization of the sin iple two-leaf treemodel, let 
A correspond to a routing matrix derived from a multicast tree ( Presti et al. ( 20021 )). where 



each node, except for the root and leaves, must have at least two children. Let us take the 
four-leaf tree in Figure [2] as an example of a multicast tree, which will be used for simulation 
purposes later. Let Xi, - ■ ■ ,^7 denote the link delays on the edges from top to bottom and 



^An analytic characteristic function corresponds to a distribution function which has moments nik of all 
orders k and limsupi^_^^[\mk\/k\]^^'' is finite 
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from left to right in the tree, i.e., the hnk delay on the edge with end node j is denoted by Xj. 
Let Yi, • • • ,Yi denote the end-to-end delays from the root node to end node 4, 5, 6 and 7, 
respectively. Then each element of Y = (Yi, • • • , 14)"^ is a partial sum of X = {Xi, • • • , Xj)'^ , 
for example, Yi = Xi -\- X2 + X^. This can be written in the form of ([T]), where yl is a 4 x 7 
binary matrix and can be derived from the linear equations. From Lemma[Tl the distributions 
of X4,X5 are determined up to a shift parameter by the joint distribution of {Yi,Y2), so are 
the distributions of Xq,X-j. Using a bottom- up induction on the tree, it follows that the 
distributions of all components of X are determined by that of Y up to shift ambiguity. The 
same arguments leads to the following theorem. 

Theorem 1. Let A he the routing matrix derived from a multicast tree, then the distribution 
ofJ^. is identifiable up to shift ambiguity. 

Theorem [2] below provides a general identifiability result for the traffic demand tomogra- 
phy mode l, whe re the routing topology is more general than a multicast tree, as studied in 
Cao et al.l (iooS). 



Theorem 2. Let B be the [/(/ + l)/2] x J matrix whose rows consist of the rows of A and 
the component-wise products of each different pair of rows from A. If B has full column rank, 
then the distributions ofX. are identifiable up to shift ambiguity. The shift ambiguity satisfies 
the constraint E[Y] = ^£;[X]. 

Proof. For the convenience of expression, ignore the shift ambiguity. Let AiA^ be the element- 
wise product of Ai and A/^. Notice that (^jAfc)X denotes the common part of (^iX, ^^X), i.e. 
(Yi, Yfc). Since {Xj} are mutually independent, by Lemma 1, the distribution of (^jAfc)X is 
identifiable. Thus the distribution of each component of SX is identifiable. Let ipk denote the 
characteristic function of B^}^, where B^ is the fcth row of B. Then for A; = 1, • • • l)/2 
and for t in a neighborhood of zero, 

log tpk (t) = Yl (t>x, (t) , 

3 

where B^j G {0,1} is the {k,j)th element of B. Since B has full column rank, {log (f)Xjit) '■ 
j = 1, • • • , J} can be uniquely solved from the above linear equations. Then under either 
(CI) or (C2), (pXj is uniquely decided. That is, the distribution of each Xj can be uniquely 
identified. □ 

We now discuss three issues related to the above identifiability results. 
1) Location ambiguity. The location ambiguity of th e toni o graph y problem has been 
recognized in previous works. To avoid such ambi guity, Vard^ 19961 ) assumed a Poisson 



distribution whose mean is the same as its variance, ICao et all l{200d ) used a power relation 
between mean and variance, and all previous discrete link delay models assume probabilities 
starting from zero delay. The important message here is that, despite the location ambiguity. 
Theorem 1 and 2 state that the distributional shape of each Xj can be determined, for 
example, all orders of central moments that exist are uniquely identified. In practice, to 
completely identify the distribution including the location, one can bring in some additional 
information such as the achievable lower bounds of X for example in delay tomography, and 
relationship between mean and variance for example in traffic demand estimation. 
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2) Conditions on the X distribution. The distributional assumption on X is very weak. A 
lot of well known distributions have analytic characteristic functions, such as Poisson, Gaus- 
sian and discrete distributions, which have been used in the literature. A mixture distribution 
that we later use to model the link delays in Section 4 has an analytic characteristic function. 
Although, the heavy-tailed distributions do not satisfy (CI), some heavy-tailed distributions 
such as a-stable distributions satisfy (C2). Despite the generality of our conditions, we do 
note that they are not necessary ones. For theoretical interest, we have constructed a counter 
example of a distribution X = (Xi, X2, X^) that cannot be identified from Y = (Yi,Y2) for 
the simple two- leaf tree model Yi = Xi + X o and Yy, = Xi + X^, as in the Appendix. 



3) Condition on the routing matrix 



h anu j:?. = -;1U- ^3, as m tne i\ppenaix. 
A. ICao et al.l (|200d ^ has shown that the full rank 



condition in Theorem [2] is necessary in the context of traffic demand tomography when X is 
Gaussian. In practice, such a condition is easily satisfied for routing matrices derived from 
realistic network topologies. A more general condition of A has been developed to prove 



the id entifiability for Poisson distributions in the context of traffic demand estimation IVardi 



(|l996l ^. We conjecture that under Vardi's more general condition of A, the distribution of 



X is identifiable up to mean and variance ambiguity under condition (CI), but we leave the 
investigation for future work. 

3 Network delay tomography using mixture modeling 

Below we focus on network delay tomography and describe a class of flexible mixture models 
for modeling link delays. It is well known that there does not exist a stand ard parametric 



model that can sufficieii t ly mo del the distributions of network link delays (see iDuffield et al 



(|200lh and iTsang et al.1 ^200i ) among others). But it is possible to define a mixture model 



which is flexible enough for link delay distributions. 

Assume that for each link j, the link delay Xj follows a mixture density function with nj 
components, Xj ~ fx^ , defined by : 

fx,{x;9j) = ^PjiKji{x), x>0 (4) 
1=1 

where 9j = {pju ' ' ' iPjuj)'^ contains the mixing probabilities with constraint pji > 0, "^iPji = 
1, and {nji} are some basis density functions. There is another practical reason that we use 
a mixture model for link delays: The characteristic function of a mixture distribution is 
a mixture of characteristic functions of the basis distributions and thus can be computed 
conveniently once the basis distributions are chosen appropriately, as shown later. In this 
case, the characteristic function of Xj can be expressed as 

1=1 

where (j)ji is the characteristic function of the basis function Kji. 

The basis functions are chosen as follows for modeling link delays. For j = 1, • • • , J, let 
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= bji < bj2 < ■ ■ ■ < < oo. Define the basis function as 

Kji{x) = point mass at zero (for zero) 
Kjk{x) = uniform on 

2<k<nj-l (for body) (6) 
) = exponential witli scale aj 
on [6j(„_i),cx)] (for tail) 

The point mass at zero link delay is used here because it is well known that for a FIFO queue 
(First In, First Out), the steady state queuing distribution has zero delay with probability one 
minus the utilization of the queue. For the body of the distribution, we choose the piecewise 
uniform model because of its simplicity and flexibility. Finally, an exponential distribution 
is used to model the tail because it is the right model for the short range dependent traffic 
model, and for long-range dependent model it represents a trade-off between accuracy and 
simplicity. 

In order to reduce the computational complexity, we choose the bin endpoints {bjk} in 
advance. The parameter of interest is composed of the mixture coefficients, denoted as 6 = 
6i,--- ,9j. To our advantage, we do not require the bins to be equally spaced. In fact, 
it is important to choose the bins that are adaptive to individual link delay distributions 
in order to obtain accurate estimates. Such a varying bin strategy is especially important 
for a heterogeneous network environment whose link delay distributions vary widely across 
links, because a single bin width value could be at the same time too coarse grained for a 
high bandwidth link with small delays but too fine-graine d to efficientl y captu re the essential 



characteristics of the delay along a low bandwidth link (jPresti et al.l (j2002l )). In addition, 
since a typical delay distribution may have a density varying a lot at different areas, it is 
important to be able to place more bins in the high density area and fewer bins in the low 
density area. 

Note that the equal-bin distribution used by most previous researchers is also a mixture 
model. Figure [3] shows two link delay distributions (in solid lines) where one ranges from 
to 12 (top) and the other from to 240 (bottom). The slashed lines are fitted curves using 
a equal-bin model with bin-width 1 which accomodates the scales of both links, but with 12 
bins for the slow link and as many as 240 bins for the fast link. The dotted lines are fitted 
curves using varying-bin models where only 10 bins are used for both links. It is clear that 
the equal-bin model fits the slow link very well, but not the fast link, while the varying-bin 
model with a small number of bins fits both links well. It is possible to use a very small 
bin-width for the equal-bin model, but it would require too many bins for the slow link. The 
varying bins here are chosen based on quantiles of the delay distribution, which works very 
well in general from our simulation experiences. In reality the quantiles are unknown and 
we can only obtain an approximation using an initial estimate of the link delay distribution. 
This process can be iterated until we get a good estimate. 

The scale parameter aj for the tail basis in Equation ([6|) is unknown and needs to be 
estimated. However, the accuracy of the scale estimate is less important if the endpoint 
of the last bin can be placed at the far end of the tail. For a further simplification, 
we can fix the tail basis with a crude estimate of aj for each link j and only estimate the 
mixing probabilities {pjk}, which is described next. 
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Figure 3: Fitting two link delay distributions using an equal-bin model and a varying-bin 
model, where the delay on the slow link (bottom) is 20 times in average of that on the 
fast link (top): the solid lines are for the link delay density functions, slashed lines for the 
estimated densities using bin-with equal to 1 (12 bins for the fast link and 240 bins for the 
slow one), and the dotted lines for the estimated densities using varying bin-widths (only 10 
bins for each). 



4 Fast algorithms derived from the General Method of Mo- 
ments 

In this section, we discuss how to estimate the parameters of mixture coefficients. It is worth 
pointing out that by Theorem [T] the parameter of the link delay mixture models defined above 
is identifiable when link delays have positive probabilities at zero, which is usually true. 



4.1 The General Method of Moments 



Following HuetaLl (l2002l ). it is not hard to show that the computational complexity of MLE 



using an EM algorithm for the above flexible mixture model is of order O(maxjn^), which 
is too expensive. In this section, we present an estimation approach for network tomograph y 
using Fourier transform, following the pioneering work of iFeuerverger and Mureikal (|l977l ). 
The estimators using this approach can be computed easily as shown below and also exhibit 
good statistical properties. The motivation is that the characteristic function of Y is sim- 
ply the product of the characteristic function of components of X as shown in Equation ([2]), 
though the distribution function is a high order convolution of those of XjS. W e deriv e the 
estimator from the General Method of Moments formally described in iHansen (Has!) and 
Carrasco and Florens ( 2000l ). To be self-contained, we give a formal description of our esti- 
mator for the tomography model below and discuss its advantages over previous approaches. 

Suppose that each Xj is modeled by a probability density function fxj{xj]9j) with an 
unknown parameter Oj. Let 9 = {9j : j = 1, . . . , J}. By Equation ([2]), the joint characteristic 



9 



function of Y is 

J 

where (j)Xj is the characteristic function with respect to fxy Let {Y(n) : 1 < n < A^} be the 
independent measurements of Y. The empirical characteristic function of Y is 

1 ^ 

n=l 

Similar to the maximum likelihood estimate which is derived by minimizing the Kullback- 
Leibler divergence between the empirical distribution and the model distribution of Y, an 
estimate of can be obtained by minimizing an L2 distance between the empirical character- 
istic function and the model characteristic function of Y, i.e., 

6 = argmin j |eAr(t; d^(t), (7) 

where 

e7v(t;0) = ViV(0Y(t)-</.Y(t;0)), 

and /i(t) is a specified probability distribution function on TZ^ (we use the sub script iV to 
show the dependence on the sample size N). 

For a continuous measure /i, the right hand side of d?]) does not have a closed form in 
general. To evaluate the integral, a Monte Carlo approximation can be used: first randomly 
draw K samples from /i(t), say {t^ : A; = 1, 2, • • • , X}, and then replace ^(t) by its empirical 
distribution based on these samples. 

Let e^iO) = (eAr(tfc; 0), fc = 1, • • • , K)^ be a column vector. We can rewrite ([7]) as 

^ = argmin e?;(0)e^(0), (8) 


where e%i{0) is the conjugate of €^{0). We call it the CF-estimator, since it is based on 
characteristic function. 

The CF-estimator can be considered as a least square estimator based on the residuals 
evaluated at ti, . . . ,tK, which are obviously correlated. Let W be the covariance matrix of 
6^(0), it is easy to show that 

= 'i^Yitj - t,,; 6) - cPYitf, e)cj)*^{tk; 9). 
This motivates a weighted version of the CF-estimator, called WCF, 

Q(W) ^ s.Tgminel{e){W + 5NlKr^e*N{0), (9) 


where Ik is the K x K identity matrix and 6]\f, a tuning parameter, is used to make sure 
the inversion is well defined. 6 should be small and we typically choose 6n of order N~^l'^. 
In practice W cannot be calculated precisely since is unknown. We can either use a W 
estimated from an initial estimate of 6 such as the CF-estimator, or iterate this process using 
an iteratively reweighed least squares, which is a common technique used in generalized linear 
models. 
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1) Statistical properties. The characteristic-function based estimators presented in this 
section fall into the class of Generalized Methods of Momen t (GMM) estimators. There is a 
cons iderable body of work on the ir statistical properties, see iFeuerverger and Mureikal (119771 ) 
and Carrasco and Florend ( 20001 ). from which the consistency and asymptotic normality of 
both CF-estimator (IHI) and WCF-estimator (191), can be established. In addition, it has been 



proved by lCarrasco and FlorensI (|2000l ) that when the probability measure fj, in Equation ([7]) 
has a density all over TZ^ , the WCF estimator is asymptotically as efficient as MLE with 
K = oo and an appropriate choice of 5]\f. 

2) Sampling of t. For both the CF and WCF estimators, the points t^^k = 1, . . . ,K are 
sampled based on a. probability measure In gener al, how to choose ^ or sample t efficiently 
is a hard problem (IFeuerverger and Mureikal (119771 )). In the following, we suggest the choice 
of ^ based on our simulation experiences. 

Since the scales of components of Y may be different, we normalize Y by its empirical 
covariance matrix and use an elliptic distribution for ^u, such as Gaussian. From simulations 
we notice that sampling t directly from a probability measure in TZ^ does not easily yield 
good results. This is due to the sparsity of t in the high dimensional space so that the 
characteristic functions 0Y(t) evaluated at most of the points are close to zero. Since the 
variance of the residual eiv(t; 6*) is equal to 1 — |(/)Y(t)|^, the closer to zero of the characteristic 
function |(/)Y(t)p, the larger the variance, and the less the information. Although it may lose 
some efficiency, simulations suggest that better performance can be achieved by sampling 
t from lower dimension subspace, for example 2-dim subspaces. When we draw t from a 
lower dimensional subspace, it implies that we minimize the residuals (difference between 
model and empirical characteristic function) only for these subspaces. This may be viewed 
as a counterpa.rt of the pseudo likelihood approach for network tomography proposed by 
Liang and Yu but in the Fourier domain. 



4.2 A Fast Algorithm by Quadratic Programming 

With the mixture model described in Section [3l the unknown parameter for the model is 



{0, : J = 1, 



, J}, the mixture coefficients. Now we describe how to estimate 9 



iteratively using the approach developed in Section 14.11 

By Equation ^ and dS} , the model characteristic function of Y is 



(t>Y{t-e) = J{eJ^,{t^A^), 



where $j(t) = {4>ji{t), 
defined in Equation ([5] 



• • • ,(j)jnj{t)) . The objective function in obtaining the CF estimate 
can then be written as 



K 

|eAr(tA:; ' 

fc=l 



K 

E 

k=l 



^Y(t,) - n {Gj'^jitiA^)) 

3=1 



It is easy to see that for each 9j, if the rest of the parameters are known, the optimization 
function is a quadratic function of 9j. Specifically, given all other parameters {9i : I = 
1, • • • ,J,l^j}, the optimal 9j can be obtained by minimizing 



C{9j) 



9jDj9j 



29jd„ 



(10) 
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where 



is an Uj x rij matrix, and 



K 

E 

k=l 



Re{^*{tlA^)^j{tiA^)} 



T/.T 



K 



d, = Y.Re{4>*Y{tk)llcl>x,{tU')'^MA')} 

k=i i^j 

is an TT-j-dim column vector. This is a standard quadratic programming problem and can 
be solved quickly. Therefore, estimation of 9 can be obtained by an iterative algorithm as 
follows. 

Algorithm 1. Iterative quadratic programming 

(1) Choose an initial value for 9j, j = 1, . . . , J. 

(2) For each j = 1, . . . ,J, estimate 6j by minimizing U0\) using quadratic programming. 

(3) Repeat Step 2 until convergence. 

A nice property of Algorithm [T] is that it always converges to a local solution because the 
objective function never increases after each iteration and is bounded below by 0. This is sim- 
ilar to EM algorithms, but care is needed in order to obtain the global minimal. Simulations 
show that {pjk = ^/nj} can serve as a good starting value. 

The computational complexity of each iteration in Algorithm [T] is 0{K IJ mayij{nj)^) 
for the CF- estimator. For the WCF-estimator, a similar iterative algorithm by quadratic 
programming can be obtained. Due to the weight matrix, the complexity of each iteration 
becomes 0{K^IJ may: j{nj)^). 



5 Simulation and Experimental Studies 

In Section [H we have developed simple and fast algorithms using a flexible mixture model 
for network delay tomography. In this section, we evaluate the performance of the proposed 
algorithms in terms of statistical efficiency and accuracy. To measure the accuracy of the 
estimation as compared to the true distributions, we use a Li-distance for discrete link delay 
distributions and a normalized Mallows distance for continuous link delay distributions. 

Our evaluation is divided into three pieces. First, we study the efficiency of our estimates 
by comparing them with that of MLE for a discrete link delay distribution with equally 
spaced bins. We show that our estimators have comparable efficiency to that of MLE which is 
statistically efficient and also computable in this setting. Second, we examine the performance 
of our estimators using model simulations for continuous link delays in an ideal scenario where 
both temporal and spatial independence hold. Model simulations demonstrate the importance 
of varying bins selection that should adapt to not only delay distributions of individual links 
but the different scales of delays across links in order to achieve satisfactory estimates. Finally, 
we use real trace driven simulations to examine the accuracy of our estimators under more 
realistic scenarios where the independence assumptions may not be strictly true as appeared 
in the Internet. Results from our trace driven simulations demonstrate that the estimates 
made by our algorithms closely match the real distributions. 
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Figure 4: Estimated link delay probability densities on a four-leaf tree from 500 end-to- 
end delay measurements. The true delay distribution for each individual link has a discrete 
probability density with 6 equally spaced bins. 



5.1 Efficiency Evaluation 

We study the efficiency of our estimators using a discrete link delay distribution with equally 
spaced bins on a four-leaf tree (Figure [2]). For link j,j = 1,...,7, the link delay has a 
discrete distribution at {0, 1,-- - ,5} with probabilities generated uniformly from the space 
Ylk=iPjk = 1 with constraints < pjk < 1. A total of 500 delay samples are generated 
for each link from its specified delay distribution and the end-to-end delays are computed 
according to the model ([T]) . The delay distributions of all seven links are estimated using the 
MLE, the CF-estimator, and the WCF-estimator. 

We repeat the experiment 100 times with different random seeds. Both the MLE and 
the CF-estimator use the uniform distribution as starting values whereas WCF uses the CF 
estimates as starting values. The weight matrix W for WCF is also derived from the CF 
estimates. For both the CF and WCF estimators, a total of 3000 samples of t are drawn 
randomly from the 2-dim subspaces of /-dimensional end-to-end delays using a Gaussian 
distribution with a scal e parameter of 5 aft er normalizing Y. We have also run the recursive 
algorithm developed in IPresti et al.1 tooi ). but we do not report the result here except to 



state t hat it often yields much poorer estimates (similar to observations made by lLiang and Yu 
(|2003l l). 



Figure m shows both the estimated and the simulated seven link delay density functions in 
one simulation experiment. We observe that all methods give reasonably accurate estimates. 
To compare errors of the different estimators, we calculate the Li distances between the 
estimated link delay density functions and the ground truth for each of the 100 experiments. 
Figure reports the 25%, 75% quantiles of the Li errors for each link in vertical line segments, 
whose middle points represent median errors. MLE has the smallest median error, and the 
median errors of CF and WCF are 50% and 22% higher than that of MLE. The results suggest 
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Figure 5: Quartiles of the Li error for the three hnk delay distribution estimates from 100 
simulation runs, where each simulation run has the same setup as in Figure [H The 25%, 75% 
quantiles are shown in line segments, and the median are shown in lines. 



that both CF and WCF are somewhat worse than as expected but comparable to MLE. 
5.2 Accuracy Evaluation Using Model Simulation 

In this subsection, we investigate the performance of the estimators for continuous delay 
distributions, which are more realistic than discrete ones since network delays are essentially 
continuous except at zero. Delay tomography in a heterogeneous network is intrinsically more 
challenging than in a homogeneous network because links with small delays are not equally 
represented as links with large delays in the end-to-end delay measurements. In addition, the 
heterogeneous environment also represents a situation where most of the existing methods 
such as MLE do not work well because they rely on simple discretization. After all, the real 
Internet is a heterogeneous network. Thus we report model simulations on a four-leaf tree 
that resemble a heterogeneous network environment. For simplicity, we do not consider the 
point mass at zero for model simulations, but we will treat this in later real trace driven 
simulations. 

To conduct a comprehensive evaluation, we run simulations for the link delay distributions 
of different shapes. Due to the space limit, we only report the results for two representative 
distributions that are i) exponential (uni-modal) and ii) a mixture of an exponential and a 
Gamma with shape parameter 2 (multi- modal) . In both cases, the average link delays on 
the four-leaf tree are 3, 1, 5, 10, 6, 4 and 20 respectively for link 1 to 7 assigned from top 
to bottom and left to right, which resembles a heterogeneous network with the average link 
delays varying by a factor of 20. We generate 2000 delay samples for each link from the 
specified delay distributions, and we estimate the seven link delay density functions from the 
resulting end-to-end delays. 

We use four different estimates of link delay distributions: CF_equaLbin, WCF_equaLbin, 
CF_varying_bin, WCF_varying_bin. All four estimates are obtained using a mixture model 
for the link delays of the same form as ([6]) with nj = 12 except removing the point mass at 
0. The difference in the mixture model for the estimates lies in the bin placement. For both 
CF_equaLbin and WCF_equaLbin, the 12 bins are equally spaced using a bin width selected for 
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Figure 6: The cumulative probability of the estimated link delay distributions on a four- 
leaf tree (Figure [2]) from 2000 end-to-end delay measurements. The true delay distributions, 
shown in solid line, are exponentials with means 3,1,5,10,6,4 and 20 for link 1 to link 7 on 
the four-leaf tree respectively (where the link index is ordered from top to bottom, left to 
right). Estimates are obtained using a mixture model of piecewise uniform of 12 bins and an 
exponential tail. 



each link based on variance estimates, wh ich are obtained by solving systems of linear equa- 
tions, following iDuffield and Prestil (120041 ). For both CF_varying_bin and WCF_varying_bin, 
the bins are located at the quantiles of the delay distributions that corresponds to probabilities 
i/13,i = 1,... ,12. 

Figure [6] and [7] plot the estimated cumulative distribution functions for each link delay 
for case i) and ii) respectively, along with the ground truth in one simulation run. From 
the figures, we observe that the estimates using varying bins are almost identical to the true 
distributions. The estimates using equal bins give satisfactory estimates for case i) but not 
quite as good for the more complex case ii). 

To measure the accuracy of the estimates, we use the Mallows distance defined for a 
cumulative distribution F and its estimate F by 



M{F, F) 



F-\p)-F-\p) 



-1, 



dp, 



where F ^ and F ^ are the inverse cumulative distributions. The Mallows distance can be 
viewed as the average of absolute difference in quantiles between two distributions. Because 
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Figure 7: The cumulative probability of the estimated link delay distributions on a four-leaf 
tree (Figure[2]) from 2000 end-to-end delay measurements. The true delay distributions, shown 
in solid line, are a mixture of exponential and Gamma distribution with a shape parameter of 
2 with means 3,1,5,10,6,4 and 20 for link 1 to link 7 on the four-leaf tree respectively (where 
the link index is ordered from top to bottom, left to right). Estimates are obtained using a 
mixture model of piecewise uniform of 12 bins and an exponential tail. 



the Mallows distance is linear to the scale of distributions, we use M{F, F) /crp, the normalized 
Mallows distance, to measure the difference between F and F, where ap is the standard 
deviation of F. 

We repeat the simulation 100 times and compute the normalized Mallows distance between 
the estimated and true distributions as the error metric for all links. Figure [8] and [9] report, 
corresponding to case i) and ii) respectively, the first and third quartiles of the errors as well 
as median errors for each link, similar to FigureEl It is clear that the varying bins improve the 
quality of estimates significantly over equal bins. (Note that the difference between CF and 
WCF are not significant though.) This suggests that selecting bins based on characteristics of 
the underlying density distributions is important in improve the accuracy in a heterogeneous 
network. 

5.3 Accuracy Evaluation Using Real Internet Traces 

We next investigate how the algorithms perform in a realistic network environment where 
some of the assumptions may not hold completely. For instance, due to the closed-loop 
control nature of the TCP protocol, the packets within the same TCP connection have strong 
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Figure 8: Quartiles of the normalized Mallows distance for the four link delay distribution 
estimates from 100 simulation runs, where the true delay distribution is an exponential. Each 
simulation run the same setup as in Figure El The first and third quartiles are shown in line 
segments, and the median are shown in lines. 



temporal dependency. Although the dependency is weakened when many TCP connections 
are multiplexed as they arrive to a link, the dependency may not be completely gone. We 
approximate a real scenario by simulating the behavior of a link using the real traces collected 
from the Internet. Since the traces include the arrival time and the size of each packet, the 
simulation sees the exact link behaviors as what the original link where the trace collected 
from seen if we set the bandwidth and the buffers the same as the original link. 

We use traces from the NLANR web site that archives packet header traces collected 
from about ten links at different locations of the Internet. The links differ in both bandwidth 
and traffic. A 90-second trace is recorded every one (two) hours for each of the links. In our 
experiment, we first assign traces collected from different sites to the links of the simulated 
network. We then sim ulate the links using the assigned traces as input using the standard 



network simulator tool iNSi (iNSi ). Moreover, we superpose the probes to the traces and record 
their per-link queuing delays as well as end-to-end delays where the latter is used as input 
for the estimation whereas the prior is for comparison with the estimates. 

Notice that the delays on the edge links in a real network may vary more than the core 
links due to its low bandwidth. In addition, the average delay may also differ dramatically 
for different links. We resemble a real network in a symmetric binary 8- leaf tree by assuming 
that both the root and the leaves of the tree are on the edge of the network whereas the 
interior links are in the core. We assign traces of high rates to links in the core and traces of 
low rates to the edge links. 

Figure [10] shows both CF and WCF estimate of the delay distribution using a varying bin 
strategy laid out in Section [3l along with the simulated distribution. The throughputs across 
different links vary by a factor of 40. It is easy to see that the estimates are extremely good 
for most links, except for link 9 that has the smallest average link delay where it shows some 
marginal error. The average normalized Mallows distance over all links is 0.065 which also 
suggests a good match between the estimates and simulated results. We have also simulated 
the four-leaf tree network and a symmetric binary 16-leaf tree network respecitvley, using 



''http:/ /pma. nlanr.net/Traces/ 
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Figure 9: Quartiles of the normalized Mallows distance for the four link delay distribution 
estimates from 100 simulation runs, where the simulated delay distributions are a mixture of 
an exponential and a Gamma with shape parameter 2. Each simulation run the same setup 
as in Figure [71 The first and third quartiles are shown in line segments, and the median are 
shown in lines. 



different traces, and the proposed algorithms give satisfactory results. 

In addition, our real trace driven simulations suggest that the link delay distributions 
excluding the tail can be well approximated by a Weibull distribution with a shape parameter 
slightly smaller than 1. This is not surprising because it has been shown that the queuing 
delay for a FIFO queue with a Fractional Brownian Motion t raffic inpu t has a Weibullian 
tail. The Weibullian form is also consistent with the finding in ICao et al.l (|2004l ) . 



6 Conclusion 

This paper has presented a general identifiability result and introduced a general estimation 
approach for the network tomography problem. For network delay tomography, a fast algo- 
rithm based on GMM has been developed for estimating the link delay distributions using 
mixture models of characteristic functions. In comparison with likelihood based approaches, 
the most significant nature of the new method is that it affords the choice of varying bin widths 
which adapts to delay variabilities of individual links and has low computational complexity. 
The new approach can be applied to traffic demand estimation as well. 
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Figure 10: The cumulative probability of the estimated link delay distributions on a 8-leaf 
binary tree from 1800 end-to-end delay measurements. The true delay distributions, shown 
in solid line, have an approximate Weibull distribution excluding the zero delay. The average 
link delays vary by a factor of forty. Estimates are obtained using a mixture model that 
consists of a point mass at 0, a piecewise uniform distribution of 6 bins, and an exponential 
tail. 



Appendix: A counter example 

Based on the proof of Lemma 1, we can construct a counter example that the distributions 
of {Xi,X2,X3} are not identifiable. Let c(t;a. A) = e-^l*l/(|t| < a) + Xe'^^ia + j - \t\)I{a < 
\t\ < a + j) he a contin uous function defined for t ^ TZ. It is easy to check using Polya's 
condition ( Lukacsl ( 197d )) that for any a > 0, A > 0, c(t; a, A) is a characteristic function 



corresponding to a symmetric, non-vanishing, bounded continuous density function. Let 
(pxM = c(t;2,l), (t>x[{t) = c(t;3,l), (t) = (t>x',{t) = (pxM = 0x^(t) = c(t;0,l). Both 
groups of distributions corresponding to characteristic functions {(/"Xfe} and {'Pxl^ fo^" {^k} 
respectively can generate the same joint distribution of {Yi,Y2)- Figure [TT] shows both their 
characteristic functions and probability density functions on a two-leaf tree: (pxi and (px^ are 
the two curves in the first box of the second row, and 4>X2 and (j)x3 are in the second and 
third box of the second row respectively, with corresponding density functions plotted in the 
first row. Notice that these distributions cannot be used for link delays which do not permit 
negative values. It is an open question whether there exist link delays whose distribution 
shapes are not identifiable even for the simple two-leaf tree tomography. 
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Figure 11: A counter example of identifiability for the two-leaf tree model where {Xi+X2,Xi + 
X3) and (X( -|- X( -|- X3) have the same joint distribution: The bottom three figures plot 
the characteristic functions: the first one for (pXx ^-iid ^^^d the second and third ones 
for (I)X2 = 4>x'^ aiid = 4>x'^ respectively; The top three figures plot the corresponding 
probability density functions. 
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